Homework1

Please upload to Brightspace a .zip archive containing your Jupyter Notebook with solutions and all data required to reproduce your solutions.

Please also prepare a requirements.txt file which lists all the packages that you have used for your homework, one package per line. This will allow us to install all required packages.

Please name your .zip archive using your full name and student id as follows - Firstname_Lastname_12345678_COMP47350_Homework1.zip.

For your Notebook, please split the code and explanations into many little cells so it is easy to see and read the results of each step of your solution. Please remember to name your variables and methods with self-explanatory names. Please remember to write comments and where needed, justifications, for the decisions you make and code you write. Feel free to revisit tips_to_keep_your_ipython_notebook_readable_and_easy_to_debug.html provided on Brightspace.

Your code and analysis is like a story that awaits to be read, make it a nice story please. Always start with an introduction about the problem and your understanding of the problem domain and data analytics solution and describe the steps you do and your findings from each step.

The accepted file formats for the homework are:

- .ipynb
- .zip
- .pdf
- .csv

Please keep the whole code in a single notebook. Usage of external tools/files is discouraged for portability reasons. Files in any other format but mentioned above can be used but will be ignored and not considered for the submission (including .doc, .rar, .7z, .pages, .xlsx, .tex etc.). Any image format is allowed to be used as far as the images appear embedded in your report (.ipynb or .pdf or .html).

Deadline: Monday, 8 March, 2021, midnight.

Exercises

This homework focuses on data understanding and preparation for a particular problem and dataset. The data comes from the Centers for Disease Control and Prevention (CDC: https://covid.cdc.gov/covid-data-tracker/). CDC is a USA health protection agency and is in charge of collecting data about the COVID-19 pandemic, and in particular, tracking cases, deaths, and trends of COVID-19 in the United States. CDC collects and makes public deidentified individual-case data on a daily basis, submitted using standardized case reporting forms. In this analysis, we focus on using the data collected by CDC to build a data analytics solution for death risk prediction. CDC collects demographic characteristics, exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and comorbidities. It also includes information on whether the individual survived or not.

The dataset we work with is a sample of the public data released by CDC, where the outcome for the target feature death_yn is known (i.e., either 'yes' or 'no'): https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf

Each student will work with a different subset of the data. The CSV file is named using the format: covid19-cdc-[your-student-number].csv, e.g., covid19-cdc-12345678.csv is the data file for a student with number 12345678. You need to work with the CSV file corresponding to your student number, available from this link: https://drive.google.com/drive/folders/1HiAbcwh7-JT3UXn5qfd4Xo89kfwXISJ9?usp=sharing. There are 4 parts for this homework. Each part has an indicative maximum percentage given in brackets, e.g., part (1) has a maximum of 40% shown as [40].

(1). [40] Prepare a data quality report for your CSV file. Below you have a set of guideline steps to help you in this process.

- Check how many rows and columns your CSV has.

- Print the first and the last 5 rows.

- Convert the features to their appropriate data types (e.g., decide which features are more appropriate as 
continuos and which ones as categorical types). 

- Drop duplicate rows and columns, if any.

- Drop constant columns, if any.

- Save your updated/cleaned data frame to a new csv file.

For the updated CSV and data frame (after column/row removal):

- Prepare a table with descriptive statistics for all the continuous features.

- Prepare a table with descriptive statistics for all the categorical features.

- Plot histograms for all the continuous features.

- Plot box plots for all the continuous features.

- Plot bar plots for all the categorical features.

- Discuss your initial findings.

- Save the initial discussion of your findings into a single data quality report PDF file.      

The PDF report should focus on the key issues identified in the data and discuss potential strategies to handle them. Simple listing of tables and plots without discussion of findings will not receive full marks. To receive full marks you need to show your understanding and analysis of the dataset, through a detailed discussion for each feature. The report should be concise and complete, the goal is not to make it long for the sake of length, but to cover all the important aspects of the features.

(2). [30] Prepare a data quality plan for the cleaned CSV file.

- Mark down all the features where there are potential problems or data quality issues.

- Propose solutions to deal with the problems identified. Explain why did you choose one solution over 
potentially many other. It is very important to provide justification for your thinking in this part and to list potential solutions, including the solution that will be implemented to clean the data.

- Apply your solutions to obtain a new CSV file where the identified data quality issues were addressed. 

- Save the new CSV file with a self explanatory name. 

- Save the data quality plan to a single PDF file.

(3). [15] Exploring relationships between feature pairs:

- Choose a subset of features you find promising and plot pairwise feature interactions (e.g., 
continuous-continuous feature plot or continuous-categorical plots or correlation plots). 
Explain your choices.

- Discuss your findings from the plots above. Do you find any features or feature combinations that are 
indicative of the target outcome? Explain in plain words (a short paragraph) the story of your
findings so far.

(4). [15] Transform, extend or combine the existing features to create a few new features (at least 3) with the aim to better capture the problem domain and the target outcome. Justify the steps and choices you are making. Add these features to your clean dataset and save it as a CSV file with a self explanatory name.

Data Analytics - Data Preparation and Understanding - HW01

Audit

Author: ARyan - 14395076

Module: COMP47350

DC: 2021-02-08

DLM: 2021-02-08

Desc: This file contains an analysis of the COVID19 data set and produces a data quality report.

Dict: The Data Dictionary for the Data Set is available at: https://www.cdc.gov/coronavirus/2019-ncov/downloads/data-dictionary.pdf

Table of Contents

  1. Introduction

  2. Exec Summary and Results

  3. Modules

  4. Constants

  5. Ingestion

  6. Cleansing

  7. Automated Analysis Functions

00. Introduction

00.01 Background

COVID-19 is an infectious disease caused by SARS-CoV-2, a coronavirus strain discovered in December 2019 first identified following an outbreak in the Chinese city Wuhan, with the WHO declaring the outbreak a global pandemic in March 2020.

Since its discovery, health organisations have been actively gathering data to assess aspects of the disease including infectivity, symptoms, and mortality rate. Active interest has been paid to factors which may increase a patient's risk of serious symptons or death.

In this analysis, we focus on using the data collected by CDC to build an analytics solution for predicting a patients' death risk prediction. CDC collects demographic characteristics, exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and comorbidities. It also includes information on whether the individual survived or not.

00.02 Problem Scope

We wish to develop a model to predict the risk of a patient dying based on various metrics collected by the CDC.

00.03 Data

The CDC collects demographic data, exposure history, disease severity indicators, outcomes, clinical data, comorbidities, and whether the patient survived. The full data dictionary provided by the CDC is available at the following location: https://www.cdc.gov/coronavirus/2019-ncov/downloads/data-dictionary.pdf

For this assignment, a sample of ten thousand rows are provided from the full dataset available from: https://covid.cdc.gov/covid-data-tracker/

00.04 Approach

The assignment was broadly approached as follows but these were non-hard boundaries:

  1. Exploratory Data Analysis - Investigsating the Data Set Provided.
  2. Data Quality Report - Investigating the Sample Provided for Data Quality Issues
  3. Data Quality Plan - Developing a plan to address and action issues with data quality.
  4. Extending Data - Addding additional features
  5. Exploratory Analysis - Comparing the relationships between key feature pairs.

This does not align with the structure provided as for the dataset provided a less rigid approach made more sense.

00.05 Limitation

As requested in the exercise, the key findings are prepared within the Notebook File and accompanying PDFs.

1. Data Quality Report

0. Background

COVID-19 is an infectious disease caused by SARS-CoV-2, a coronavirus strain discovered in December 2019 first identified following an outbreak in the Chinese city Wuhan, with the WHO declaring the outbreak a global pandemic in March 2020.

Since its discovery, health organisations have been actively gathering data to assess aspects of the disease including infectivity, symptoms, and mortality rate. Active interest has been paid to factors which may increase a patient's risk of serious symptons or death.

In this analysis, we focus on using the data collected by CDC to build an analytics solution for predicting a patients' death risk prediction. CDC collects demographic characteristics, exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and comorbidities. It also includes information on whether the individual survived or not.

1. Overview

This report will outline the initial findings based on the provided sample of the CDC dataset. It will summarise the data, describe the various data quality issues observed and how they will be addressed.

Appendix includes terminology, assumptions, explanations and summary of changes made to the original dataset. This also includes feature summaries and boxplots used to visualise the data.

2. Summary

The following are the key points in relation to the data set and approach:

3. Logical Integrity

As the dataset has a heavy focus on categorical data, the following tests were carried out to asses the integrity of the dataset

4. Non-Datetime Categorical Features

There are 8 non-categorical features in the dataset:

5 Datetime Categorical Features

There are 4 categorical datetime features in the dataset:

6 BoxPlots

BoxPlots were produced for all categorical data. These are present in the appendix due to the size of the file. All pairs of data and single value info was calculated as an initial exploration.

8. Note:

The steps provided in the assignment outline more of a linearisation in the process, however upon reviewing the data I did not believe the outlined processed was particularly suitable for this dataset.

In particular, the processing steps outlined suggest the removal of duplicate values prior to data exploration. As I did not beleive the records were, in fact, duplicates but instead were driven by other elements, it was more reasonable to explore the relationships between various factors before taking any steps to drop rows with overlap, in order to better understand why.

Similarly, the steps provided suggest not adding columns until the final section. Due to the nature of the data and the variety of missing values within some of the indicator and date columns, it seemed to me that valuable information could be obtained based on my initial exploration before any final removal occurs. In particular, the onset datetime column looks to have key value in relation to the asymptomatic prevalence of COVID and the time between initial presentation and symptom onset date. Therefore, adjusting the nature of this column and adding on attributes which reflected the data that was in the original column while preserving and enhancing the data set was logical as an approach before simply dropping this feature for missing prevalency. Similarly, the race column contains race and ethnicity combined however this can be replaced with the racial info as that alone is sufficient to capture the concatenated nature of this. While there may be a need from a reporting purpose in the CDC to compare Hispanic vs Non-Hispanics demographics, reducing the memory usage of the field by stripping the redundant info still allows recovery if this would be insightful.

Due to all of the above, the data quality plan and data quality actioning were, in a sense, completed as a joint process as proper cleansing of the set did not allow for a full linearisation of this process. This steps is detailed below.

2 Data Quality Plan

Based on the initial insights, the following is the data quality plan. Full details on reasoning have been already outlined in the data quality report.

A key note is the author wishes to avoid dropping data as an intermediate step unless necessary or directly contradictory data. Acquisition cost of data is too significant to justify dropping data until a step just prior to usage in ML models as retrieval can be challenging. As such, data is being imputed into missing values in general. The Data Action Dictionary is:

data_action_dictionary=

                      {
                      'cdc_case_earliest_dt':
                            {
                            "Data Quality Issues": "515 Rows where not minimum of other dates populated"
                            ,"Data Quality Actions": "Confirm reason. Otherwise leave as-is"
                            }

                        ,'cdc_report_dt':                                
                            {
                            "Data Quality Issues": "Depreciated"
                            ,"Data Quality Actions":"Drop"
                            }

                        ,'pos_spec_dt':
                            {
                            "Data Quality Issues":"72% of data missing"
                            ,"Data Quality Actions":"Drop after using for status correction"
                            }


                        ,'onset_dt':
                            {
                            "Data Quality Issues":"49% of Data Missing. <1% of dates where onset_dt is too far after case date."
                            ,"Data Quality Actions":"Split into days since symptom. Flag missing data. Drop column. Statistically relevant. Enquire on why some values are so extreme after earliest date"
                            }


                        ,'current_status':

                            {
                            "Data Quality Issues": "Probable Cases that should be Laboratory Confirmed Cases"
                            ,"Data Quality Actions":"Update instances"
                            }


                        ,'sex':
                            {
                            "Data Quality Issues": "Missing and Unknown flags"
                            ,"Data Quality Actions": "Bin into Unknown category"
                            }


                        ,'age_group':
                            {
                            "Data Quality Issues": "Missing and Unknown flags"
                            ,"Data Quality Actions":"Bin into groups"
                            }

                      ,'race_ethnicity_combined':
                            {
                            "Data Quality Issues":"Concatenated field. Race sufficient to capture all info."
                            ,"Data Quality Actions":"Split field and drop ethnicity"
                            }


                      ,'hosp_yn':
                            {
                            "Data Quality Issues":"Missing, Unknown, and OTH values"
                            ,"Data Quality Actions":"Bin unknown into groups"
                            }


                     ,'icu_yn':
                            {
                            "Data Quality Issues":"Missing data 72%."
                            ,"Data Quality Actions":"Determine if missing because 'no'. Column is relevant so await answer before dropping"
                            }


                     ,'death_yn':
                            {
                            "Data Quality Issues":"Not applicable"
                            ,"Data Quality Actions":"No action"
                            }


                     ,'medcond_yn':
                                {
                            "Data Quality Issues":"80% missing"
                            ,"Data Quality Actions":"Grouping missing consistently. Column is relevant so keep until answer on cause of missing values"
                            }

            }

A key note is the author wishes to avoid dropping data as an intermediate step unless necessary or directly contradictory data. Acquisition cost of data is too significant to justify dropping data until a step just prior to usage in ML models as retrieval can be challenging. As such, data is being imputed into missing values in general.

01. Modules for Code

02. Constants

03. Read in CSV

1. Report begins

1.1. Check how many rows and columns your CSV has.

1.2. Print the first and the last 5 rows.

1.3 Convert the features to their appropriate data types (e.g., decide which features are more appropriate as continuos and which ones as categorical types).

Discussion:

Based on initially observing Head and Tail, the following are quick obsercations:

Date format on _datetime columns is 'YYYY/MM/DD'

cdc_report_dt and pos_spect_dt and onset_dt contain empty values

current_status looks to have only one unique value - to be confirmed.

sex - Looks like it should be one of the four values in the data dictionary

age_group - Also looks like it should be one of the mapped values but ensure no entry errors and check for unknowns

race_ethnicity - Looks like missing values are unknown

hosp_yn,icu_yn,medcond_yn = Boolean, but missing is present - identify if blank or not?

death_yn = Confirm Yes or No in all instances.

The data types and values are provided by the CDC data dictionary, and have been embedded into the data_dictionary variable. Analysis of the sample data suggests all data bar the _dt columns are categorical, while the _dt columns are of type datetime (and lacking timestamp). As an initial step, all columns will be converted to the appropriate type.

Data Conversion Actions:

Based on the data dictionary and the sample data above the following decisions are made with respect to data conversion:

cdc_case_earliest_dt is a datetime of type 'YYYY/MM/DD'

cdc_report_dt is a nullable datetime of type 'YYYY/MM/DD'. Per the data dictionary this column is depreciated and should be dropped for 'cdc_case_earliest_dt'. This will be actioned later.

pos_spec_df is a nullable datetime of type 'YYYY/MM/DD'.

current_status is a category, but might be removable.

sex is a category but need to validate the list is valid

age_group is a category but need to validate the list is valid

race is a category but need to validate the list is valid

hosp_yn is a category but need to investigate entries

icu_yn is a category but need to investigate entries

death_yn is a category which looks mandatory.

medcond_yn is a category but need to investigate values

1.3.1 Create a staging dataframe.

The dataflow for industry-implemented data science applications typically follows the ETL Process of:

  1. Input
  2. Staging
  3. Load

A staging dataframe is a copy of the raw dataframe which is used for all work. Work on the original raw dataframe is unadvised as if recovery of original data is needed, the process of reloading the raw data may not be advised. This has a memory hit, but for the small dataset we are dealing with this will not be a problem.

1.3.2 Get the column types into appropriate lists from the data dictionary

1.3.3 Convert the columns

1.3.3.1 Validate the conversion

Check that the columns were converted and explicitly outline the result

1.3.4 Explore the Data.

1.3.4.1 Produce The Describe table

Description: Before dropping 'duplicates' or any further manipulation, it makes sense to explore the dataframe and identify noteworthy elements. In particular, we will pay attention to 'nullable' columns, identify what values are in each column and the composition of columns, and any apparent noteworthy aspects before doing anything further. Duplicates or columns should not be dropped until the data is thoroughly understood in case there is valuable data or there are, in fact, not duplicates.

1.3.4.1.2 Datetime type note.

The source code of pandas deals with datetimes as both strings and numeric values

Discussion:

Population is primarily good outside of the date columns. Oddly the booleans have a uniqueness count above what is expected. We will identify this later and impute 'missing' values consistently. Only 3% of rows feature a death_yn, 25% of rows are missing a medical condition, and 23% of rows are msising icu_yn.

1.3.4.2 Unique values and frequency.

As part of the initial exploration, we will group all unique values, and provide a dataframe output and a PDF graphing the values.

A sensible initial starting point is to look first at the single column, and then all pairs of relations as an initial exploratory step.

In order to capture the fact that we have not yet cleansed any data, and as we have not yet fully examined which dates are more prevelant, we will normalise each of the graphs but NOT scale the graph results. As a result, another dimension will be added to the graph capturing the dominance of certain features by how much of the graph the column occupies.

Similarly, as I am not grouping for multi-indexed dataframes the same column twice, these values will be present in the dataframe with '0' due to the usage of the 'category' column type (this is in-built Python behaviour where all categories are shown even if not explicitly listed when grouped over). These single value featuers should be examined as part of the grouping over single features

1.3.4.2.1 Single Groupings

1.3.4.2.2 Paired Groupings

1.3.4.2.3 Create print statements showing lists with the unique values for validation

1.3.4.2.4 Check that cdc_case_earliest_dt is the minimum date as per the CDC

1.3.4.2.5 Investigating Duplicates

1.3.4.2.6 Investigating ICU Admission Without Hospital Admission

1.3.4.2.7 Investigating Probable Cases with Details

1.3.4.2.8 Investigating Missing ICU Proportion

1.3.5. Data Quality Report – Initial - Pre-Alteration.

0. Background

COVID-19 is an infectious disease caused by SARS-CoV-2, a coronavirus strain discovered in December 2019 first identified following an outbreak in the Chinese city Wuhan, with the WHO declaring the outbreak a global pandemic in March 2020.

Since its discovery, health organisations have been actively gathering data to assess aspects of the disease including infectivity, symptoms, and mortality rate. Active interest has been paid to factors which may increase a patient's risk of serious symptons or death.

In this analysis, we focus on using the data collected by CDC to build an analytics solution for predicting a patients' death risk prediction. CDC collects demographic characteristics, exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and comorbidities. It also includes information on whether the individual survived or not.

1. Overview

This report will outline the initial findings based on the provided sample of the CDC dataset. It will summarise the data, describe the various data quality issues observed and how they will be addressed.

Appendix includes terminology, assumptions, explanations and summary of changes made to the original dataset. This also includes feature summaries and boxplots used to visualise the data.

2. Summary

The following are the key points in relation to the data set and approach:

3. Logical Integrity

As the dataset has a heavy focus on categorical data, the following tests were carried out to asses the integrity of the dataset

4. Non-Datetime Categorical Features

There are 8 non-categorical features in the dataset:

5 Datetime Categorical Features

There are 4 categorical datetime features in the dataset:

6 BoxPlots

BoxPlots were produced for all categorical data. These are present in the appendix due to the size of the file. All pairs of data and single value info was calculated as an initial exploration.

8. Note:

The steps provided in the assignment outline more of a linearisation in the process, however upon reviewing the data I did not believe the outlined processed was particularly suitable for this dataset.

In particular, the processing steps outlined suggest the removal of duplicate values prior to data exploration. As I did not beleive the records were, in fact, duplicates but instead were driven by other elements, it was more reasonable to explore the relationships between various factors before taking any steps to drop rows with overlap, in order to better understand why.

Similarly, the steps provided suggest not adding columns until the final section. Due to the nature of the data and the variety of missing values within some of the indicator and date columns, it seemed to me that valuable information could be obtained based on my initial exploration before any final removal occurs. In particular, the onset datetime column looks to have key value in relation to the asymptomatic prevalence of COVID and the time between initial presentation and symptom onset date. Therefore, adjusting the nature of this column and adding on attributes which reflected the data that was in the original column while preserving and enhancing the data set was logical as an approach before simply dropping this feature for missing prevalency. Similarly, the race column contains race and ethnicity combined however this can be replaced with the racial info as that alone is sufficient to capture the concatenated nature of this. While there may be a need from a reporting purpose in the CDC to compare Hispanic vs Non-Hispanics demographics, reducing the memory usage of the field by stripping the redundant info still allows recovery if this would be insightful.

Due to all of the above, the data quality plan and data quality actioning were, in a sense, completed as a joint process as proper cleansing of the set did not allow for a full linearisation of this process. This steps is detailed below.

2 Data Quality Plan

Based on the initial insights, the following is the data quality plan. Full details on reasoning have been already outlined in the data quality report.

A key note is the author wishes to avoid dropping data as an intermediate step unless necessary or directly contradictory data. Acquisition cost of data is too significant to justify dropping data until a step just prior to usage in ML models as retrieval can be challenging. As such, data is being imputed into missing values in general.

Across all columns and the dataset, due to the determination that duplicates are primarily from generic and missing values, data will be removed as these rows are unlikely to be of value to the final model and thus will be dropped. Missing values will be consistently categorised into 'unknown'.

1.3.6 Save as CSV File.

Note: Saving to CSV due to requirements however saving to pickle as the 'right' way to save a dataframe. Saving to a csv file means you have to reassign attributes on subsequent loads. Pickled files are designed to save that info.

2.1 Read Cleansed File

2.2 Rerun Descriptive Stats and Graph

2.2.2 Rerun Graphs and Pairing

2.3 Consistent Values for Missing

2.4 Correction of Values for Missing Probable Case

2.5 Drop pos_spec_dt for nulls and update hospital admission. Remove oth hosp_yn.

2.6 Drop cdc_report_dt for depreciation

2.7 Split Race Ethnicity

I decided that ethnicity's info was captured in race so drop ethnicity.

2.8 Rerun Descriptive Stats

Some of these values are not sensible.

It is well publicised that the COVID19 symptomatic time period is within 14 days of diagnosis. While it would need to be confirmed with a domain expert, my recommendation given that 98% of values are 0 and a small volume of values are populated and most are less than 1, that we add a field called symptomatic. If the onset date is populated and within 14 days of diagnosis you are symptomatic, if outside this field should be zeroes and set to non-symptomatic, and if today's date is pre-14 days after the max date it is zero. All missing values are assumed to be denoted as 'unknown'. We drop the post-14 day values.

All our data is over 14 days away.

2.9 POS Value Correction and Feature Addition

2.10 Plot the Earliest Case Date

2.11 Rerun Describe

2.12 Save the Cleansed Dataset

3.0 and 4.0 Extension and Pairing Of Features

3.1 read in adf

3.2 Extend ADF with day, month, year of diagnosis

3.3 The question asks to analyse and then extend.

I am going to extend and analyse.

Extension and Analysis Commentary

Extension Commentary

I elected to pair and plot all combinations of features within the dataset.

To extend the set, I created day, month, year, and workday featuers for the cdc_case date. This was primarily to help determine if cases followed any trend in terms of timing in the week or month or year which could be insightful. Confounding factors could be if certain areas operate on a rotating staff basis, then potentially trends in deaths could point for further area to investigate.

Adding on to my earlier analysis and inspection, I changed the onset date into a column highlighting the number of days after diagnosis that symptoms appeared. My initial hypothesis surrounding this is that individuals who got tested and did not become symptomatic until later would have had a better expected outcome due to earlier intervention and treatment management and that this could have predictive power in determining if a patient was at risk of dying.

Finally, I added flags for whether demographic or medical data was missing for a particular record. Although I personally wished to avoid removing duplicates until it becomes fed into an ML model, it will be necessary to experiment with different featuers being present or absent given the high quantity of missing values within the dataset. These flags are to provide a convenient way to filter the dataset and focus on the rows where a full dataset is present if needed.

Analysis

For the purpose of analysing pairs of featurs, beyond some of the analysis already conducted, I am electing to focus on plotting the target feature death_yn verssu other cathegorical columns. Other features of interest may be pbriefly discussed however the year focus will be on the death_yn feature against others. Unfortunately as the data is primarily categorical, the analysis focuses primarily on feature distribution.

Key Points:

One area of note is that a 100% stacked bar chart diminishes the importance of how prevalent features actually area and fails to saccount for how single instances are more likely to have an impact on a feature of smaller size. Due to this, the barcharts which are stacked can give a highly misleading view of the data (although can be useful to gain a perspective on relevant factors to our model) and should be considered in relation to the stacked (but not 100% stacked) bar charts produced continuously within the report.